Project: Investigate Attendance of Medical Appointments in Brazil

Table of Contents

Introduction

The dataset contains 110,527 medical appointments of patients in various neighbourhoods in Brazil. It contains 14 observations that contain patient data. The last observation records whether or not, the patient attended the appointment.

Dataset Description

The observations are as follows:

  • Patient Id : A unique patient identifier
  • Appointment Id : A unique identifier for each appointment
  • Gender: M or F to denote Male or Female
  • Scheduled Day: Date some registered the appointment
  • Appointment Day: Date patient is required to attend the appointment
  • Age: Age of the patient
  • Neighbourhood: Location of the appointment
  • Scholarship: True or False to denote whether the patient is a recipient of the Bolsa welfare program
  • Hipertension: True or False
  • Diabetes: True or False
  • Alcoholism: True or False
  • Handicap: True or False
  • SMS_received: Denotes whether patient a message through SMS prior to the appointment
  • No-show: True or False

Question(s) for Analysis

  1. Which gender has the highest number of missed appointments?
  2. Are patients who are recipients of welfare likely to attend appointments?
  3. What day of the week has the most missed attendance?
  4. Does the number of waiting days affect attendance?
In [1]:
#Install missing packages using pip
#Restart Kernel afterwards
!pip install geopy;
!pip install --upgrade plotly;
!pip install --upgrade pyopenssl;
!pip install --upgrade certifi;
!pip install --upgrade pandas==1.1.5;
!pip install --upgrade tensorflow-tensorboard==0.1.1;
!pip --disable-pip-version-check install requests;
Requirement already satisfied: geopy in /opt/conda/lib/python3.6/site-packages (2.2.0)
Requirement already satisfied: geographiclib<2,>=1.49 in /opt/conda/lib/python3.6/site-packages (from geopy) (1.52)
Requirement already up-to-date: plotly in /opt/conda/lib/python3.6/site-packages (5.8.0)
Requirement already satisfied, skipping upgrade: tenacity>=6.2.0 in /opt/conda/lib/python3.6/site-packages (from plotly) (8.0.1)
Collecting pyopenssl
  Using cached https://files.pythonhosted.org/packages/d5/9f/9c0e3288b85f907a008f9d31318b0e4de31b2f67724a8745e633741f609c/pyOpenSSL-22.0.0-py2.py3-none-any.whl
Collecting cryptography>=35.0 (from pyopenssl)
  Using cached https://files.pythonhosted.org/packages/51/05/bb2b681f6a77276fc423d04187c39dafdb65b799c8d87b62ca82659f9ead/cryptography-37.0.2.tar.gz
  Installing build dependencies ... done
Requirement already satisfied, skipping upgrade: cffi>=1.12 in /opt/conda/lib/python3.6/site-packages (from cryptography>=35.0->pyopenssl) (1.15.0)
Requirement already satisfied, skipping upgrade: pycparser in /opt/conda/lib/python3.6/site-packages (from cffi>=1.12->cryptography>=35.0->pyopenssl) (2.18)
Building wheels for collected packages: cryptography
  Running setup.py bdist_wheel for cryptography ... error
  Complete output from command /opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m8c6q10k/cryptography/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-9vb2v532 --python-tag cp36:
  /opt/conda/lib/python3.6/importlib/__init__.py:126: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
    return _bootstrap._gcd_import(name[level:], package, level)
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/cryptography
  copying src/cryptography/__about__.py -> build/lib.linux-x86_64-3.6/cryptography
  copying src/cryptography/fernet.py -> build/lib.linux-x86_64-3.6/cryptography
  copying src/cryptography/__init__.py -> build/lib.linux-x86_64-3.6/cryptography
  copying src/cryptography/exceptions.py -> build/lib.linux-x86_64-3.6/cryptography
  copying src/cryptography/utils.py -> build/lib.linux-x86_64-3.6/cryptography
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat
  copying src/cryptography/hazmat/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat
  copying src/cryptography/hazmat/_oid.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat
  creating build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/certificate_transparency.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/extensions.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/general_name.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/base.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/ocsp.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/name.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  copying src/cryptography/x509/oid.py -> build/lib.linux-x86_64-3.6/cryptography/x509
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/backends
  copying src/cryptography/hazmat/backends/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings
  copying src/cryptography/hazmat/bindings/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/poly1305.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/_serialization.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/_asymmetric.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/cmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/hmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/padding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/hashes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/_cipheralgorithm.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/keywrap.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  copying src/cryptography/hazmat/primitives/constant_time.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/poly1305.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/ciphers.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/backend.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/dh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/decode_asn1.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/x448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/x25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/ed25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/ed448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/encode_asn1.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/rsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/aead.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/dsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/cmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/hmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/utils.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/hashes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/ec.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  copying src/cryptography/hazmat/backends/openssl/x509.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
  copying src/cryptography/hazmat/bindings/openssl/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
  copying src/cryptography/hazmat/bindings/openssl/binding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
  copying src/cryptography/hazmat/bindings/openssl/_conditional.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  copying src/cryptography/hazmat/primitives/serialization/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  copying src/cryptography/hazmat/primitives/serialization/ssh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  copying src/cryptography/hazmat/primitives/serialization/pkcs7.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  copying src/cryptography/hazmat/primitives/serialization/base.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  copying src/cryptography/hazmat/primitives/serialization/pkcs12.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
  copying src/cryptography/hazmat/primitives/twofactor/totp.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
  copying src/cryptography/hazmat/primitives/twofactor/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
  copying src/cryptography/hazmat/primitives/twofactor/hotp.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/concatkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/scrypt.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/hkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/pbkdf2.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/kbkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  copying src/cryptography/hazmat/primitives/kdf/x963kdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  copying src/cryptography/hazmat/primitives/ciphers/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  copying src/cryptography/hazmat/primitives/ciphers/algorithms.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  copying src/cryptography/hazmat/primitives/ciphers/modes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  copying src/cryptography/hazmat/primitives/ciphers/aead.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  copying src/cryptography/hazmat/primitives/ciphers/base.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/dh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/x448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/x25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/ed25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/ed448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/rsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/dsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/padding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/utils.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/ec.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  copying src/cryptography/hazmat/primitives/asymmetric/types.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
  running egg_info
  writing src/cryptography.egg-info/PKG-INFO
  writing dependency_links to src/cryptography.egg-info/dependency_links.txt
  writing requirements to src/cryptography.egg-info/requires.txt
  writing top-level names to src/cryptography.egg-info/top_level.txt
  reading manifest file 'src/cryptography.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  no previously-included directories found matching 'docs/_build'
  warning: no previously-included files found matching 'vectors'
  warning: no previously-included files matching '*' found under directory 'vectors'
  warning: no previously-included files matching '*' found under directory '.github'
  warning: no previously-included files found matching 'release.py'
  warning: no previously-included files found matching '.coveragerc'
  warning: no previously-included files found matching 'codecov.yml'
  warning: no previously-included files found matching '.readthedocs.yml'
  warning: no previously-included files found matching 'dev-requirements.txt'
  warning: no previously-included files found matching 'tox.ini'
  warning: no previously-included files found matching 'mypy.ini'
  warning: no previously-included files matching '*' found under directory '.circleci'
  adding license file 'LICENSE'
  adding license file 'LICENSE.APACHE'
  adding license file 'LICENSE.BSD'
  adding license file 'LICENSE.PSF'
  writing manifest file 'src/cryptography.egg-info/SOURCES.txt'
  copying src/cryptography/py.typed -> build/lib.linux-x86_64-3.6/cryptography
  creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
  copying src/cryptography/hazmat/bindings/_rust/__init__.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
  copying src/cryptography/hazmat/bindings/_rust/asn1.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
  copying src/cryptography/hazmat/bindings/_rust/ocsp.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
  copying src/cryptography/hazmat/bindings/_rust/x509.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
  running build_ext
  running build_rust
  
      =============================DEBUG ASSISTANCE=============================
      If you are seeing a compilation error please try the following steps to
      successfully install cryptography:
      1) Upgrade to the latest pip and try again. This will fix errors for most
         users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
      2) Read https://cryptography.io/en/latest/installation/ for specific
         instructions for your platform.
      3) Check our frequently asked questions for more information:
         https://cryptography.io/en/latest/faq/
      4) Ensure you have a recent Rust toolchain installed:
         https://cryptography.io/en/latest/installation/#rust
  
      Python: 3.6.3
      platform: Linux-4.15.0-1083-gcp-x86_64-with-debian-stretch-sid
      pip: 18.1
      setuptools: 59.6.0
      setuptools_rust: 1.1.2
      =============================DEBUG ASSISTANCE=============================
  
  error: can't find Rust compiler
  
  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
  
  To update pip, run:
  
      pip install --upgrade pip
  
  and then retry package installation.
  
  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
  
  This package requires Rust >=1.41.0.
  
  ----------------------------------------
  Failed building wheel for cryptography
  Running setup.py clean for cryptography
  Complete output from command /opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m8c6q10k/cryptography/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" clean --all:
  /opt/conda/lib/python3.6/importlib/__init__.py:126: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
    return _bootstrap._gcd_import(name[level:], package, level)
  running clean
  removing 'build/lib.linux-x86_64-3.6' (and everything under it)
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.6' does not exist -- can't clean it
  removing 'build'
  running clean_rust
  
      =============================DEBUG ASSISTANCE=============================
      If you are seeing a compilation error please try the following steps to
      successfully install cryptography:
      1) Upgrade to the latest pip and try again. This will fix errors for most
         users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
      2) Read https://cryptography.io/en/latest/installation/ for specific
         instructions for your platform.
      3) Check our frequently asked questions for more information:
         https://cryptography.io/en/latest/faq/
      4) Ensure you have a recent Rust toolchain installed:
         https://cryptography.io/en/latest/installation/#rust
  
      Python: 3.6.3
      platform: Linux-4.15.0-1083-gcp-x86_64-with-debian-stretch-sid
      pip: 18.1
      setuptools: 59.6.0
      setuptools_rust: 1.1.2
      =============================DEBUG ASSISTANCE=============================
  
  error: can't find Rust compiler
  
  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
  
  To update pip, run:
  
      pip install --upgrade pip
  
  and then retry package installation.
  
  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
  
  This package requires Rust >=1.41.0.
  
  ----------------------------------------
  Failed cleaning build dir for cryptography
Failed to build cryptography
Installing collected packages: cryptography, pyopenssl
  Found existing installation: cryptography 2.1.4
    Uninstalling cryptography-2.1.4:
      Successfully uninstalled cryptography-2.1.4
  Running setup.py install for cryptography ... error
    Complete output from command /opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m8c6q10k/cryptography/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-13dq9inc/install-record.txt --single-version-externally-managed --compile:
    /opt/conda/lib/python3.6/importlib/__init__.py:126: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
      return _bootstrap._gcd_import(name[level:], package, level)
    running install
    /tmp/pip-build-env-oe3v0z_a/lib/python3.6/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      setuptools.SetuptoolsDeprecationWarning,
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/cryptography
    copying src/cryptography/__about__.py -> build/lib.linux-x86_64-3.6/cryptography
    copying src/cryptography/fernet.py -> build/lib.linux-x86_64-3.6/cryptography
    copying src/cryptography/__init__.py -> build/lib.linux-x86_64-3.6/cryptography
    copying src/cryptography/exceptions.py -> build/lib.linux-x86_64-3.6/cryptography
    copying src/cryptography/utils.py -> build/lib.linux-x86_64-3.6/cryptography
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat
    copying src/cryptography/hazmat/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat
    copying src/cryptography/hazmat/_oid.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat
    creating build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/certificate_transparency.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/extensions.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/general_name.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/base.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/ocsp.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/name.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    copying src/cryptography/x509/oid.py -> build/lib.linux-x86_64-3.6/cryptography/x509
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/backends
    copying src/cryptography/hazmat/backends/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings
    copying src/cryptography/hazmat/bindings/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/poly1305.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/_serialization.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/_asymmetric.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/cmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/hmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/padding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/hashes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/_cipheralgorithm.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/keywrap.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    copying src/cryptography/hazmat/primitives/constant_time.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/poly1305.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/ciphers.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/backend.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/dh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/decode_asn1.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/x448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/x25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/ed25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/ed448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/encode_asn1.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/rsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/aead.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/dsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/cmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/hmac.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/utils.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/hashes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/ec.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    copying src/cryptography/hazmat/backends/openssl/x509.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/backends/openssl
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
    copying src/cryptography/hazmat/bindings/openssl/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
    copying src/cryptography/hazmat/bindings/openssl/binding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
    copying src/cryptography/hazmat/bindings/openssl/_conditional.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/openssl
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    copying src/cryptography/hazmat/primitives/serialization/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    copying src/cryptography/hazmat/primitives/serialization/ssh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    copying src/cryptography/hazmat/primitives/serialization/pkcs7.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    copying src/cryptography/hazmat/primitives/serialization/base.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    copying src/cryptography/hazmat/primitives/serialization/pkcs12.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/serialization
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
    copying src/cryptography/hazmat/primitives/twofactor/totp.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
    copying src/cryptography/hazmat/primitives/twofactor/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
    copying src/cryptography/hazmat/primitives/twofactor/hotp.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/twofactor
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/concatkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/scrypt.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/hkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/pbkdf2.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/kbkdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    copying src/cryptography/hazmat/primitives/kdf/x963kdf.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/kdf
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    copying src/cryptography/hazmat/primitives/ciphers/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    copying src/cryptography/hazmat/primitives/ciphers/algorithms.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    copying src/cryptography/hazmat/primitives/ciphers/modes.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    copying src/cryptography/hazmat/primitives/ciphers/aead.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    copying src/cryptography/hazmat/primitives/ciphers/base.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/ciphers
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/dh.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/__init__.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/x448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/x25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/ed25519.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/ed448.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/rsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/dsa.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/padding.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/utils.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/ec.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    copying src/cryptography/hazmat/primitives/asymmetric/types.py -> build/lib.linux-x86_64-3.6/cryptography/hazmat/primitives/asymmetric
    running egg_info
    writing src/cryptography.egg-info/PKG-INFO
    writing dependency_links to src/cryptography.egg-info/dependency_links.txt
    writing requirements to src/cryptography.egg-info/requires.txt
    writing top-level names to src/cryptography.egg-info/top_level.txt
    reading manifest file 'src/cryptography.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    no previously-included directories found matching 'docs/_build'
    warning: no previously-included files found matching 'vectors'
    warning: no previously-included files matching '*' found under directory 'vectors'
    warning: no previously-included files matching '*' found under directory '.github'
    warning: no previously-included files found matching 'release.py'
    warning: no previously-included files found matching '.coveragerc'
    warning: no previously-included files found matching 'codecov.yml'
    warning: no previously-included files found matching '.readthedocs.yml'
    warning: no previously-included files found matching 'dev-requirements.txt'
    warning: no previously-included files found matching 'tox.ini'
    warning: no previously-included files found matching 'mypy.ini'
    warning: no previously-included files matching '*' found under directory '.circleci'
    adding license file 'LICENSE'
    adding license file 'LICENSE.APACHE'
    adding license file 'LICENSE.BSD'
    adding license file 'LICENSE.PSF'
    writing manifest file 'src/cryptography.egg-info/SOURCES.txt'
    copying src/cryptography/py.typed -> build/lib.linux-x86_64-3.6/cryptography
    creating build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
    copying src/cryptography/hazmat/bindings/_rust/__init__.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
    copying src/cryptography/hazmat/bindings/_rust/asn1.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
    copying src/cryptography/hazmat/bindings/_rust/ocsp.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
    copying src/cryptography/hazmat/bindings/_rust/x509.pyi -> build/lib.linux-x86_64-3.6/cryptography/hazmat/bindings/_rust
    running build_ext
    running build_rust
    
        =============================DEBUG ASSISTANCE=============================
        If you are seeing a compilation error please try the following steps to
        successfully install cryptography:
        1) Upgrade to the latest pip and try again. This will fix errors for most
           users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
        2) Read https://cryptography.io/en/latest/installation/ for specific
           instructions for your platform.
        3) Check our frequently asked questions for more information:
           https://cryptography.io/en/latest/faq/
        4) Ensure you have a recent Rust toolchain installed:
           https://cryptography.io/en/latest/installation/#rust
    
        Python: 3.6.3
        platform: Linux-4.15.0-1083-gcp-x86_64-with-debian-stretch-sid
        pip: 18.1
        setuptools: 59.6.0
        setuptools_rust: 1.1.2
        =============================DEBUG ASSISTANCE=============================
    
    error: can't find Rust compiler
    
    If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
    
    To update pip, run:
    
        pip install --upgrade pip
    
    and then retry package installation.
    
    If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
    
    This package requires Rust >=1.41.0.
    
    ----------------------------------------
  Rolling back uninstall of cryptography
Command "/opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m8c6q10k/cryptography/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-13dq9inc/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-m8c6q10k/cryptography/
Requirement already up-to-date: certifi in /opt/conda/lib/python3.6/site-packages (2022.5.18.1)
Requirement already up-to-date: pandas==1.1.5 in /opt/conda/lib/python3.6/site-packages (1.1.5)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas==1.1.5) (2.8.2)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas==1.1.5) (2017.3)
Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in /opt/conda/lib/python3.6/site-packages (from pandas==1.1.5) (1.19.5)
Requirement already satisfied, skipping upgrade: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.11.0)
Collecting tensorflow-tensorboard==0.1.1
  Using cached https://files.pythonhosted.org/packages/7c/69/35b5fd3571a32faa0121cf79cb6acd33156de000f60a74cbdd1e77a5bf9b/tensorflow_tensorboard-0.1.1-py3-none-any.whl
Collecting markdown==2.2.0 (from tensorflow-tensorboard==0.1.1)
Requirement already satisfied, skipping upgrade: numpy>=1.11.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (1.19.5)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (0.30.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.10 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (0.14.1)
Requirement already satisfied, skipping upgrade: bleach==1.5.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (1.5.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.2.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (3.5.1)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (1.11.0)
Requirement already satisfied, skipping upgrade: html5lib==0.9999999 in /opt/conda/lib/python3.6/site-packages (from tensorflow-tensorboard==0.1.1) (0.9999999)
Requirement already satisfied, skipping upgrade: setuptools in /opt/conda/lib/python3.6/site-packages (from protobuf>=3.2.0->tensorflow-tensorboard==0.1.1) (38.4.0)
Installing collected packages: markdown, tensorflow-tensorboard
  Found existing installation: Markdown 2.6.9
Cannot uninstall 'Markdown'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
Requirement already satisfied: requests in /opt/conda/lib/python3.6/site-packages (2.18.4)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests) (3.0.4)
Requirement already satisfied: idna<2.7,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests) (2.6)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests) (1.22)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests) (2022.5.18.1)
In [2]:
import warnings 
warnings.simplefilter(action = 'ignore', category=FutureWarning) #removes warnings about future deprecations in libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopy 
import numpy as np
import plotly

%matplotlib inline
In [3]:
print("Pandas version " + pd.__version__) #requires minimum pandas 1.1.5
print("Geopy version " + geopy.__version__)
print("Plotly version " + plotly.__version__)
Pandas version 1.1.5
Geopy version 2.2.0
Plotly version 5.8.0

Data Wrangling

General Properties

In [4]:
dataset = ('./Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv')
df = pd.read_csv(dataset)
df.head()
Out[4]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
  • ScheduledDay and AppointmentDay columns are objects and not in datetime format

Check for missing values

In [6]:
datatype = df.dtypes #checks the datatype of the column
sum_na = df.isna().sum() #the sum of any missing values in each column
na_ = df.isna().any() #checks for any missing values
info = pd.concat([na_,sum_na,datatype], axis=1, keys=['na_', 'sum_na','datatype'])
print(info)
                  na_  sum_na datatype
PatientId       False       0  float64
AppointmentID   False       0    int64
Gender          False       0   object
ScheduledDay    False       0   object
AppointmentDay  False       0   object
Age             False       0    int64
Neighbourhood   False       0   object
Scholarship     False       0    int64
Hipertension    False       0    int64
Diabetes        False       0    int64
Alcoholism      False       0    int64
Handcap         False       0    int64
SMS_received    False       0    int64
No-show         False       0   object
  • There are no missing values in the dataset

Check for duplicate records

In [7]:
duplicates = df.duplicated().sum()
print("There are {} duplicate records in the dataset ".format(duplicates))
There are 0 duplicate records in the dataset 

Check the number of categories in each feature

In [8]:
for feature in df.columns:
    print('{} has total {} categories \n'
          .format(feature,len(df[feature].value_counts())))
PatientId has total 62299 categories 

AppointmentID has total 110527 categories 

Gender has total 2 categories 

ScheduledDay has total 103549 categories 

AppointmentDay has total 27 categories 

Age has total 104 categories 

Neighbourhood has total 81 categories 

Scholarship has total 2 categories 

Hipertension has total 2 categories 

Diabetes has total 2 categories 

Alcoholism has total 2 categories 

Handcap has total 5 categories 

SMS_received has total 2 categories 

No-show has total 2 categories 

  • The handcap columns has 5 categories, instead of 2 categories (0,1) to represent True or False in the category.
  • The number of Patient ID's indicates, some patients had more than one appointment.

Check the distribution of the features by plotting bar graphs

In [9]:
features_numeric = ['Age','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received']
In [10]:
df[features_numeric].hist(figsize=(15,20));

Check the distribution of No-show column

In [11]:
df['No-show'].value_counts().plot(kind="bar",
                              xlabel='No show',
                              ylabel='Total Count',
                              title='No Show Distribution',figsize=(8,5));
  • Majority of patients in the dataset attended their appointments.

Investigate the Age column

In [12]:
df['Age'].hist(figsize=(10,8), bins=15);
  • Age distribution is skewed to the right,with majority of patients being under 60.

Check the unique value in the Age column

In [13]:
df['Age'].unique()
Out[13]:
array([ 62,  56,   8,  76,  23,  39,  21,  19,  30,  29,  22,  28,  54,
        15,  50,  40,  46,   4,  13,  65,  45,  51,  32,  12,  61,  38,
        79,  18,  63,  64,  85,  59,  55,  71,  49,  78,  31,  58,  27,
         6,   2,  11,   7,   0,   3,   1,  69,  68,  60,  67,  36,  10,
        35,  20,  26,  34,  33,  16,  42,   5,  47,  17,  41,  44,  37,
        24,  66,  77,  81,  70,  53,  75,  73,  52,  74,  43,  89,  57,
        14,   9,  48,  83,  72,  25,  80,  87,  88,  84,  82,  90,  94,
        86,  91,  98,  92,  96,  93,  95,  97, 102, 115, 100,  99,  -1])

Query the ages less than 0

In [14]:
df.query('Age <0')
Out[14]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z 2016-06-06T00:00:00Z -1 ROMÃO 0 0 0 0 0 0 No

Query the ages equal to 0

In [15]:
df.query('(Age == 0) & (Alcoholism == 1)')
Out[15]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
  • There is an an entry recorded in the age column as -1 and also 0. We are going to operate under the assumption that -1 represents an unborn child, while 0 represents a baby of less than 1 year.
  • This assumption is supported by the fact that the record does not indicate any pre-existing conditions such as Diabetes and Alcoholism.

Investigate the Handcap column

Check the unique values in the column

In [16]:
df['Handcap'].unique()
Out[16]:
array([0, 1, 2, 3, 4])
  • There are 4 unique values in the column instead of 2.

Check the value counts of the 4 unique values

In [17]:
df['Handcap'].value_counts()
Out[17]:
0    108286
1      2042
2       183
3        13
4         3
Name: Handcap, dtype: int64

Data Cleaning

Correct data in 'Handcap' column to remove values greater than 1, since 0 means handicap and 1 means handicap

The following function will convert any value above 1 to 1, and 0 will remain 0

In [18]:
def correct_handcap(x):
  if x >= 1:
    return 1
  else:
    if x == 0:
      return 0

Apply the correct_handcap() function

In [19]:
df['Handcap'] = df['Handcap'].apply(correct_handcap)
df['Handcap'].value_counts()
Out[19]:
0    108286
1      2241
Name: Handcap, dtype: int64
  • The Handcap column now consists only of values, 0 and 1.

Change ScheduledDay, AppointmentDay datatype to datetime

In [20]:
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
  • This columns are now in a datetime format

Create Weekday observation

In [21]:
df['App_weekday'] = df['AppointmentDay'].dt.dayofweek
  • This observation will assist in the lambda operation that occurs in the next code cell determining whether the scheduled appointment occured on a weekday or weekend.
In [22]:
df['Part_of _week'] = df['App_weekday'].apply(lambda x:'weekend' if x >= 5 else 'weekday')

Derive names of the days of the week

In [23]:
df['App_dayname'] = df["AppointmentDay"].dt.day_name()

This has created a column containing names of the days of the week, Sunday to Monday.

Obtain the hour from the Scheduled Appointment

In [24]:
df['Schdl_hour'] = df['ScheduledDay'].dt.hour

A column containing the hour extracted from the time and date from the 'ScheduledDay' column has ben created.

Create observation that determines which part of the day, the scheduling appointment occured

The following function will categorise the different hours of the day according to the time

In [25]:
def get_duration (t):
    if t >= 4 and t <= 9:
        return 'Early morning'
    elif t > 9 and t <= 12:
        return ' Morning'
    elif t > 12  and t <=13:
        return 'Noon'
    elif t > 13 and t <= 16:
        return 'Afternoon'
    elif t > 16 and t <=20:
        return 'Evening'
    elif t > 20 and t <= 24:
        return 'Night'
    elif t < 4 :
        return 'Late Night'
In [26]:
df["Part_of_day"] = df['Schdl_hour'].apply(get_duration)

This function uses the hour value derived from the Scheduled Appointmnet time and contained in the Schdl_hour column to classify the time into;

  • Early Morning - (04:01 to 09:00)

  • Morning- (09:01 to 12:00)

  • Noon- (12:00 to 13:00)

  • Afternoon- (13:01 to 16:00)

  • Evening - (16:01 to 20:00)

  • Night - (20:01 to 00:00)

  • Late Night - (00:01 to 04:00)

Get the number of days between the Scheduling Day and Appointment Day

This is achieved by obtaining the difference in the number of days between the 'AppointmentDay' and 'ScheduledDay' and passing the values into the 'waiting_days' column.

In [27]:
df['waiting_days'] = (df['AppointmentDay'].dt.date) - (df['ScheduledDay'].dt.date)
df['waiting_days'] = df['waiting_days'].dt.days

Investigate 'waiting_days' column

In [28]:
df['waiting_days'].describe()
Out[28]:
count    110527.000000
mean         10.183702
std          15.254996
min          -6.000000
25%           0.000000
50%           4.000000
75%          15.000000
max         179.000000
Name: waiting_days, dtype: float64
  • There exists a negative difference in days before the 1st quantile.

Query the data to see which records have a negative value in the 'waiting_days' column

In [29]:
df.query('waiting_days < 0')
Out[29]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show App_weekday Part_of _week App_dayname Schdl_hour Part_of_day waiting_days
27033 7.839273e+12 5679978 M 2016-05-10 10:51:53+00:00 2016-05-09 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 Yes 0 weekday Monday 10 Morning -1
55226 7.896294e+12 5715660 F 2016-05-18 14:50:41+00:00 2016-05-17 00:00:00+00:00 19 SANTO ANTÔNIO 0 0 0 0 1 0 Yes 1 weekday Tuesday 14 Afternoon -1
64175 2.425226e+13 5664962 F 2016-05-05 13:43:58+00:00 2016-05-04 00:00:00+00:00 22 CONSOLAÇÃO 0 0 0 0 0 0 Yes 2 weekday Wednesday 13 Noon -1
71533 9.982316e+14 5686628 F 2016-05-11 13:49:20+00:00 2016-05-05 00:00:00+00:00 81 SANTO ANTÔNIO 0 0 0 0 0 0 Yes 3 weekday Thursday 13 Noon -6
72362 3.787482e+12 5655637 M 2016-05-04 06:50:57+00:00 2016-05-03 00:00:00+00:00 7 TABUAZEIRO 0 0 0 0 0 0 Yes 1 weekday Tuesday 6 Early morning -1
  • There cannot exist a negative time difference since the patients, could not have attended their appointments, and then called in at a later day to book their previous appointment. This 5 records will be dropped from the dataset.

Query the records with a negative day difference to obtain one of the Patient Id's for further investigation.

In [30]:
df.query('waiting_days < 0').iloc[0,0]
Out[30]:
7839272661752.0

Using the specific Patient Id check for other appointments and their corresponding data.

In [31]:
df.query('PatientId == 7839272661752.0')
Out[31]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show App_weekday Part_of _week App_dayname Schdl_hour Part_of_day waiting_days
3370 7.839273e+12 5730318 M 2016-05-24 08:27:43+00:00 2016-05-24 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 No 1 weekday Tuesday 8 Early morning 0
27033 7.839273e+12 5679978 M 2016-05-10 10:51:53+00:00 2016-05-09 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 Yes 0 weekday Monday 10 Morning -1
100002 7.839273e+12 5787285 M 2016-06-08 09:40:13+00:00 2016-06-08 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 No 2 weekday Wednesday 9 Early morning 0
100003 7.839273e+12 5752857 M 2016-05-31 12:56:41+00:00 2016-06-01 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 No 2 weekday Wednesday 12 Morning 1
101919 7.839273e+12 5777702 M 2016-06-06 14:19:28+00:00 2016-06-06 00:00:00+00:00 38 RESISTÊNCIA 0 0 0 0 1 0 No 0 weekday Monday 14 Afternoon 0
  • This specific Patient Id shows that the patient had other multiple appointments that occurred before and after the erroneous appointment recorded using AppointmentID : 5679978. All their other characteristics are similar.
  • Only the appointment with the negative 'waiting_day' difference shows a missed attendance. This is the same for all the records showing a negative difference.

  • This 5 records will be dropped from the dataset.

Use the same condition to drop the 5 records from teh dataset.

In [32]:
df = df[df.waiting_days >= 0]

Query the dataset to confirm the rows have been dropped.

In [33]:
negative_difference = len(df.query('waiting_days < 0'))
print("There are {} records that display a negative value in the 'waiting_days' column.".format(negative_difference))
There are 0 records that display a negative value in the 'waiting_days' column.

Encode the 'No-show' column for better intuition.

In [34]:
df['Attendance'] = df['No-show'].apply(lambda x: 'Attended' if x == 'No' else 'Missed');
In [35]:
df['No-show'].value_counts()
Out[35]:
No     88208
Yes    22314
Name: No-show, dtype: int64
In [36]:
df['Attendance'].value_counts()
Out[36]:
Attended    88208
Missed      22314
Name: Attendance, dtype: int64
In [37]:
df['No-show'] = df['No-show'].apply(lambda x: 0 if x == 'No' else 1);

Converted the 'No-show' column so as 1 denotes 'Yes' and 0 denotes 'No'

Get geolocation data for plotting Neighbourhoods

Obtain a list of all the neighbourhoods in the dataset

In [38]:
neighbourhood_list = df['Neighbourhood'].unique()
In [39]:
neighbourhood_list
Out[39]:
array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
       'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
       'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
       'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
       'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
       'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
       'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
       'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
       'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
       'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
       'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
       'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
       'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
       'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANTA LUÍZA',
       'SANTA LÚCIA', 'BARRO VERMELHO', 'ESTRELINHA', 'FORTE SÃO JOÃO',
       'FONTE GRANDE', 'ENSEADA DO SUÁ', 'SANTOS REIS', 'PIEDADE',
       'JESUS DE NAZARETH', 'SANTA TEREZA', 'CRUZAMENTO',
       'ILHA DO PRÍNCIPE', 'ROMÃO', 'COMDUSA', 'SANTA CECÍLIA',
       'VILA RUBIM', 'DE LOURDES', 'DO QUADRO', 'DO CABRAL', 'HORTO',
       'SEGURANÇA DO LAR', 'ILHA DO BOI', 'FRADINHOS', 'NAZARETH',
       'AEROPORTO', 'ILHAS OCEÂNICAS DE TRINDADE', 'PARQUE INDUSTRIAL'],
      dtype=object)
In [40]:
country = ', BRAZIL'
with_country = [neighbourhood + country for neighbourhood in neighbourhood_list]

Using the Geopy library obtain the latitude and longitude information of the neighbourhoods

In [41]:
!conda config --set ssl_verify False
In [42]:
%%time
from geopy.geocoders import Nominatim

locator = Nominatim(user_agent="kenneth_API") ##
geopy.geocoders.options.default_timeout = 1000


LAT = []
LONG = []
for x in with_country:
    geolocation = locator.geocode(x, timeout=1000) 
    if geolocation is None:  #checks whether the name matches to a location
        LAT.append(None)     #appends a nullvalue when location not foundd
        LONG.append(None)
    else:
        latitude = geolocation.latitude  #appends geodata to list
        longitude = geolocation.longitude  
        
    LAT.append(latitude)
    LONG.append(longitude);
CPU times: user 206 ms, sys: 8.21 ms, total: 214 ms
Wall time: 40.4 s

Create a dataframe to store the location data

In [43]:
df_location_data = pd.DataFrame(list(zip(neighbourhood_list, LAT, LONG)), columns=['Neighbourhood', 'Latitude','Longitude'])
In [44]:
df_location_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighbourhood  81 non-null     object 
 1   Latitude       80 non-null     float64
 2   Longitude      80 non-null     float64
dtypes: float64(2), object(1)
memory usage: 2.0+ KB

Check for missing values

In [45]:
df_location_data['Latitude'].isnull()
Out[45]:
0     False
1     False
2     False
3     False
4     False
      ...  
76    False
77    False
78    False
79     True
80    False
Name: Latitude, Length: 81, dtype: bool
In [46]:
df_location_data.tail(10)
Out[46]:
Neighbourhood Latitude Longitude
71 DO QUADRO -20.316869 -40.349843
72 DO CABRAL -23.580160 -46.658208
73 HORTO -20.309364 -40.310522
74 SEGURANÇA DO LAR -20.263117 -40.296360
75 ILHA DO BOI -3.164690 -58.202112
76 FRADINHOS -20.307131 -40.326980
77 NAZARETH -20.310290 -40.316110
78 AEROPORTO -3.775718 -38.527795
79 ILHAS OCEÂNICAS DE TRINDADE NaN NaN
80 PARQUE INDUSTRIAL -3.775718 -38.527795
  • ILHAS OCEÂNICAS DE TRINDADE is not accesible through the Geopy library and therefore has null values
  • According to Wikipedia, refers to an archipelago east of the coast of Brazil and forms part of the Brazilian state of Espírito Santo.
  • The archipelago consists of five islands and several rocks and stacks; Trindade is the largest island, with an area of 10.1 square kilometres (3.9 square miles).
  • For the purpose of plotting, the coordinates of the main island in the archipelago, 'Ilha de Trinidade' are going to be used inplace.
  • Latitude : -20.524892
  • Longitude : -29.324559

Drop the neighbourhood from the location data and replace it with "Ilha de Trinidade"

In [47]:
df_location_data = df_location_data[df_location_data.Neighbourhood != 'ILHAS OCEÂNICAS DE TRINDADE']
In [48]:
df_location_data = df_location_data.append(
    {'Neighbourhood': ' ILHA DE TRINDADE','Latitude':-20.524892,'Longitude': -29.324559}, ignore_index=True)
In [49]:
df_location_data.tail(10)
Out[49]:
Neighbourhood Latitude Longitude
71 DO QUADRO -20.316869 -40.349843
72 DO CABRAL -23.580160 -46.658208
73 HORTO -20.309364 -40.310522
74 SEGURANÇA DO LAR -20.263117 -40.296360
75 ILHA DO BOI -3.164690 -58.202112
76 FRADINHOS -20.307131 -40.326980
77 NAZARETH -20.310290 -40.316110
78 AEROPORTO -3.775718 -38.527795
79 PARQUE INDUSTRIAL -3.775718 -38.527795
80 ILHA DE TRINDADE -20.524892 -29.324559
In [50]:
df.query('Neighbourhood == "ILHAS OCEÂNICAS DE TRINDADE"') #check records with that value
Out[50]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes ... Handcap SMS_received No-show App_weekday Part_of _week App_dayname Schdl_hour Part_of_day waiting_days Attendance
48754 5.349869e+11 5583947 F 2016-04-14 12:25:43+00:00 2016-05-13 00:00:00+00:00 51 ILHAS OCEÂNICAS DE TRINDADE 0 0 0 ... 0 0 1 4 weekday Friday 12 Morning 29 Missed
48765 7.256430e+12 5583948 F 2016-04-14 12:26:13+00:00 2016-05-13 00:00:00+00:00 58 ILHAS OCEÂNICAS DE TRINDADE 0 0 0 ... 0 0 1 4 weekday Friday 12 Morning 29 Missed

2 rows × 21 columns

In [51]:
df.loc[48754, "Neighbourhood"] = "ILHA DE TRINIDADE" #get indexes and use loc method to change
df.loc[48765, "Neighbourhood"] = "ILHA DE TRINIDADE"
In [52]:
df.loc[48754]
Out[52]:
PatientId                       5.34987e+11
AppointmentID                       5583947
Gender                                    F
ScheduledDay      2016-04-14 12:25:43+00:00
AppointmentDay    2016-05-13 00:00:00+00:00
Age                                      51
Neighbourhood             ILHA DE TRINIDADE
Scholarship                               0
Hipertension                              0
Diabetes                                  0
Alcoholism                                0
Handcap                                   0
SMS_received                              0
No-show                                   1
App_weekday                               4
Part_of _week                       weekday
App_dayname                          Friday
Schdl_hour                               12
Part_of_day                         Morning
waiting_days                             29
Attendance                           Missed
Name: 48754, dtype: object
In [53]:
df.query('Neighbourhood == "ILHA DE TRINIDADE"') #confirm the neighbourhood value is changed
Out[53]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes ... Handcap SMS_received No-show App_weekday Part_of _week App_dayname Schdl_hour Part_of_day waiting_days Attendance
48754 5.349869e+11 5583947 F 2016-04-14 12:25:43+00:00 2016-05-13 00:00:00+00:00 51 ILHA DE TRINIDADE 0 0 0 ... 0 0 1 4 weekday Friday 12 Morning 29 Missed
48765 7.256430e+12 5583948 F 2016-04-14 12:26:13+00:00 2016-05-13 00:00:00+00:00 58 ILHA DE TRINIDADE 0 0 0 ... 0 0 1 4 weekday Friday 12 Morning 29 Missed

2 rows × 21 columns

Create the observation 'Pre-existing condition'

This will display 0 if the patient lacks all four observations (Hipertension, Diabetes, Alcoholism and Handcap) and 1 if the have any of the four observations

In [54]:
df['Pre_existing_conditions'] = df['Hipertension'] + df['Diabetes']+ df['Alcoholism']+ df['Handcap']
In [55]:
df['Pre_existing_conditions'] = df['Pre_existing_conditions'].apply(lambda x:1 if x >= 1 else 0)
In [56]:
df['Pre_existing_conditions'].value_counts()
Out[56]:
0    84112
1    26410
Name: Pre_existing_conditions, dtype: int64
In [57]:
df.head()
Out[57]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes ... SMS_received No-show App_weekday Part_of _week App_dayname Schdl_hour Part_of_day waiting_days Attendance Pre_existing_conditions
0 2.987250e+13 5642903 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00 62 JARDIM DA PENHA 0 1 0 ... 0 0 4 weekday Friday 18 Evening 0 Attended 1
1 5.589978e+14 5642503 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0 0 0 ... 0 0 4 weekday Friday 16 Afternoon 0 Attended 0
2 4.262962e+12 5642549 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00 62 MATA DA PRAIA 0 0 0 ... 0 0 4 weekday Friday 16 Afternoon 0 Attended 0
3 8.679512e+11 5642828 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00 8 PONTAL DE CAMBURI 0 0 0 ... 0 0 4 weekday Friday 17 Evening 0 Attended 0
4 8.841186e+12 5642494 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0 1 1 ... 0 0 4 weekday Friday 16 Afternoon 0 Attended 1

5 rows × 22 columns

Rearrange the columns to have 'No-show' and 'Attendance' at the end of the dataframe

In [58]:
cols = df.columns.tolist()
print(cols)
['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show', 'App_weekday', 'Part_of _week', 'App_dayname', 'Schdl_hour', 'Part_of_day', 'waiting_days', 'Attendance', 'Pre_existing_conditions']
In [59]:
 df = df[['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
          'AppointmentDay','App_weekday','Part_of _week',
          'App_dayname', 'Schdl_hour', 'Part_of_day', 'waiting_days', 'Age', 'Neighbourhood', 'Scholarship',
          'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap','Pre_existing_conditions',
          'SMS_received', 'No-show', 'Attendance']]

Plot the Neighbourhoods on a map

In [60]:
import plotly.express as px

fig = px.line_geo(lat=[0,15,20,35], lon=[5,10,25,30])
fig = px.scatter_mapbox(
    df_location_data,  # Our DataFrame
    lat="Latitude",
    lon="Longitude",
    center={"lat":-4.047995, "lon":-40.864349},
   
    width=1000,  # Width of map
    height=1000,  # Height of map
    hover_data=["Neighbourhood"])

fig.update_layout(mapbox_style="open-street-map")


fig.show()
  • Most neighbourhoods are located in states that are in the east of Brazil and share the coastline
  • The neighbourhoods are all in Brazil
  • Ilha de Trinidade is an island in the Pacific Ocean, and only 2 appoitments from this area are recorded in the dataset.
In [61]:
df.head()
Out[61]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay App_weekday Part_of _week App_dayname Schdl_hour Part_of_day ... Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap Pre_existing_conditions SMS_received No-show Attendance
0 2.987250e+13 5642903 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00 4 weekday Friday 18 Evening ... JARDIM DA PENHA 0 1 0 0 0 1 0 0 Attended
1 5.589978e+14 5642503 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00 4 weekday Friday 16 Afternoon ... JARDIM DA PENHA 0 0 0 0 0 0 0 0 Attended
2 4.262962e+12 5642549 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00 4 weekday Friday 16 Afternoon ... MATA DA PRAIA 0 0 0 0 0 0 0 0 Attended
3 8.679512e+11 5642828 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00 4 weekday Friday 17 Evening ... PONTAL DE CAMBURI 0 0 0 0 0 0 0 0 Attended
4 8.841186e+12 5642494 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00 4 weekday Friday 16 Afternoon ... JARDIM DA PENHA 0 1 1 0 0 1 0 0 Attended

5 rows × 22 columns

Exploratory Data Analysis

Question 1: Which gender has the highest number of missed appointments?

Check the distribution of overall attendance

In [62]:
Attended = df.Attendance == 'Attended' #create masks
Missed = df.Attendance == 'Missed'

Get the number of unique Patient Id's and Appointment Id's

In [63]:
df['PatientId'].nunique() #unique patient Id's
Out[63]:
62299
In [64]:
df['AppointmentID'].nunique() #unique number of Appointment Id's
Out[64]:
110522
  • There are 62299 unique Patient Id's that are linked to 110522 medical appointments.
In [65]:
df.Attendance.value_counts()
Out[65]:
Attended    88208
Missed      22314
Name: Attendance, dtype: int64
In [66]:
plt.rc('font', size=15)
plt.rcParams['figure.figsize'] = [25, 7]

plt.bar(x=(df.Attendance.value_counts().index),height=df.Attendance.value_counts(), color='#c3d5e8')
plt.title('Appointment Attendance')

plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
  • Majority of appointments are attended.
In [67]:
attendance_by_percentage=(df.Attendance.value_counts()/df.Attendance.value_counts().sum())*100
attendance_by_percentage
Out[67]:
Attended    79.810354
Missed      20.189646
Name: Attendance, dtype: float64
In [68]:
fig = plt.figure(figsize =(8, 8),tight_layout=False)
plt.pie(attendance_by_percentage, labels =('Attended','Missed'), colors=('turquoise','lime'),autopct='%1.1f%%');
plt.title('Attendance %')
plt.legend()
plt.show()
  • Only 20.2% of all appointments are missed.

Get the number of Appointments by Gender

In [69]:
df.Gender.value_counts() 
print("There are {} appointments attributed to female patients and {} appointments attributed to male patients"
      .format(df.Gender.value_counts().values[0],
             df.Gender.value_counts().values[1]))
There are 71837 appointments attributed to female patients and 38685 appointments attributed to male patients

Plot this distribution

In [70]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender.value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Total Gender Distribution %')
plt.legend()
plt.show()
  • The number of female patients is significantly higher at 65% while the percentage of men is 35%
In [71]:
df.Gender[Attended].value_counts() #attended appointments by gender
Out[71]:
F    57246
M    30962
Name: Gender, dtype: int64
In [72]:
print("Female patients attended {} while male patients attended {} of {} appointments"
      .format(df.Gender[Attended].value_counts().values[0],
             df.Gender[Attended].value_counts().values[1],
             df.AppointmentID[Attended].value_counts().sum()))
Female patients attended 57246 while male patients attended 30962 of 88208 appointments
In [73]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender[Attended].value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Attended Appointments %')
plt.show()
  • The ratio of attendance between female (64.9%) and male(35.1%) is similar to the overall gender distribution
  • Females- comprise of 64.9% of attended appointments
  • Males comprise of 35.1% of attended appointments
  • There are more female patients.
  • Female patients attended more appointments.

Check whether pre-existing conditions contribute to the higher number of female patients

In [74]:
with_pre_existing = df['Pre_existing_conditions'] == 1 #create mask
In [75]:
df.Age[with_pre_existing]
Out[75]:
0         62
4         56
5         76
25        46
26        45
          ..
110483    60
110492    33
110496    37
110499    66
110515    33
Name: Age, Length: 26410, dtype: int64
In [76]:
by_male = df.Gender == 'M' #create mask
by_female = df.Gender == 'F'
In [77]:
df.Pre_existing_conditions.value_counts() #value counts for Pre-existing conditions
Out[77]:
0    84112
1    26410
Name: Pre_existing_conditions, dtype: int64
In [78]:
print("There are {} patients with one or more pre-existing conditions while {} patients lack any pre-existing conditions"
      .format(df.Pre_existing_conditions.value_counts().values[1],df.Pre_existing_conditions.value_counts().values[0]))
There are 26410 patients with one or more pre-existing conditions while 84112 patients lack any pre-existing conditions
In [79]:
fig, ax = plt.subplots(1,2, figsize=(20,8))



ax[0].scatter(x=df.Age[by_female],
              y=(df['Diabetes']+df['Hipertension']+df['Alcoholism']+df['Handcap'])[by_female],
              alpha=0.8, color='teal', label='Female')
ax[0].set_xlabel("Age")
ax[0].set_ylabel("Number of Pre-existing conditions")
ax[0].legend()

ax[0].set_title("Distribution of Pre-existing Between Female and Male by Age")
ax[1].set_title("Distribution of Pre-existing Between Female and Male by Age")

ax[1].scatter(x=df.Age[by_male],
              y=(df['Diabetes']+df['Hipertension']+df['Alcoholism']+df['Handcap'])[by_male],
              alpha=0.5, color='magenta',label='Male')
ax[1].set_xlabel("Age")
ax[1].set_ylabel("Number of Pre-existing conditions")
ax[1].legend()

plt.show()
  • Patients under 20 generally have only have one pre-existing condition.
  • There is an increase in the number of pre-existing conditions as ge progresses.
  • Patients who posses 3 pre-existing conditions, are found between the ages of 40 and 80.

Check to see if there is a relationship between Age and Gender

In [80]:
np.mean(df.Age[by_male]), np.mean(df.Age[by_female]) #mean ages by gender
Out[80]:
(33.737443453534965, 38.89391260770912)
In [81]:
df.Age[by_male].describe() #descriptive statistics of age among men
Out[81]:
count    38685.000000
mean        33.737443
std         24.435465
min          0.000000
25%         10.000000
50%         33.000000
75%         54.000000
max        100.000000
Name: Age, dtype: float64
In [82]:
df.Age[by_female].describe() #descriptive statistics of age among women
Out[82]:
count    71837.000000
mean        38.893913
std         22.154926
min         -1.000000
25%         21.000000
50%         39.000000
75%         56.000000
max        115.000000
Name: Age, dtype: float64
  • Female patients have a mean age of 38.89
  • Male patients have a mean age of 33.74

Plot distribution of age based on gender

In [83]:
plt.rc('font', size=15)
fig = plt.figure(figsize =(10, 8))


plt.hist(df.Age[by_male], bins=10, alpha=0.5, label='Male', color='purple',edgecolor='black')
plt.hist(df.Age[by_female], bins=10, alpha = 0.5, label='Female', color='teal',edgecolor='black')

plt.axvline(df.Age[by_male].mean(), color='y', linestyle='dashed', linewidth=5, label='Male mean')
plt.axvline(df.Age[by_female].mean(), color='r', linestyle='dashed', linewidth=5, label='Female mean')

plt.axvline(df.Age[by_male].median(), color='y', linestyle='dashdot', linewidth=1, label='Male median')
plt.axvline(df.Age[by_female].median(), color='r', linestyle='dashdot', linewidth=1, label='Female median')

x_ticks = [0,10,20,30,40,50,60,70,80,90,100,110,120]
plt.title('Age Distribution By Gender')

plt.legend()
plt.show();
In [84]:
from scipy.stats import skew
print("The skewness for Age of male patients is {}, and female patients is {}"
      .format(skew(df.Age[by_male]),
              skew(df.Age[by_female])))
The skewness for Age of male patients is 0.24956763639919374, and female patients is 0.08626391004581227
  • There are overall more female patients.
  • There a more male than female children under the age of 10. This is the only age bracket where the number of males is higher than females
  • Male age distribution is more skewed to the right than female age distribution. This signifies that the difference between the mean and median is greater in males than in females.
In [85]:
df.Gender[Missed].value_counts() #missed appointmnets by gender
Out[85]:
F    14591
M     7723
Name: Gender, dtype: int64

Plot bar graph showing distribution of missed appointments by gender

In [86]:
plt.figure(figsize=(25,10))
plt.bar(x=(df.Gender[Missed].value_counts().index),height=df.Gender[Missed].value_counts(), color='teal')
plt.title('Distribution of Missed Appointments By Gender', fontsize=15)
plt.xlabel='Gender'
plt.ylabel='Missed Appointments'


plt.box(False)
plt.grid(True)
plt.legend()
plt.show()

Plot pie chart showing distribution of missed appointments by gender

In [87]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender[Missed].value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Missed Appointments %')
plt.show()
  • The number of missed appointments by female patients: 14591
  • The numbr of missed appointments by male patients: 7723

  • Females comprise of 65.4% of missed appointments

  • Males comprise of 34.6% of missed appointments
  • Gender does not seem to affect attendance.
  • Female patients missed more appointments.

Question 2: Are patients who are recipients of welfare likely to attend appointments?

In [88]:
df.Scholarship.value_counts()
Out[88]:
0    99661
1    10861
Name: Scholarship, dtype: int64
In [89]:
print("There are {} appointments of patients enrolled in the welfare program while the rest {} appointments are of patients not enrolled"
      .format(df.Scholarship.value_counts().values[1],
             df.Scholarship.value_counts().values[0]))
There are 10861 appointments of patients enrolled in the welfare program while the rest 99661 appointments are of patients not enrolled

Plot bar graph to show welfare distribution among appointments

In [90]:
plt.figure(figsize=(10,7))
plt.bar(x=(df.Scholarship.value_counts().index[0]),height=df.Scholarship.value_counts().values[0], color='cyan', label='Non-Recipient')
plt.bar(x=(df.Scholarship.value_counts().index[1]),height=df.Scholarship.value_counts().values[1], color='plum', label='Recipient')
plt.title('No. of Appointments of Patients Receiving Welfare', fontsize=15)
plt.ylabel='Appointments'
ax = plt.gca()
ax.get_xaxis().set_visible(False)
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()

Get the number of patients enrolled in the welfare program

In [91]:
with_welfare = df.Scholarship == 1 #receive welfare
no_welfare = df.Scholarship == 0 #not enrolled in welfare program
In [92]:
len(df.PatientId[with_welfare].unique()), len(df.PatientId[no_welfare].unique()) #apply condition to dataset
Out[92]:
(5788, 56511)
In [93]:
patient_by_welfare=np.array([len(df.PatientId[with_welfare].unique()), len(df.PatientId[no_welfare].unique())], dtype='int64')
patient_by_welfare #contains distribution of welfare among patients
Out[93]:
array([ 5788, 56511])

Plot a pie chart to show the percentages of this distribution

In [94]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(patient_by_welfare, labels =('Recipient','Non-recipient'), colors=('indigo','pink'),autopct='%1.1f%%');
plt.title('Welfare Enrollment %')
plt.legend()
plt.show()
  • There are 5788 patients enrolled in the welfare.
  • There are 56511 patients not enrolled in the welfare program.

Check the relationship between enrolment in the welfare program and pre-existing conditions

In [95]:
df.Pre_existing_conditions[with_welfare].value_counts() # 0 for false, 1 for true
Out[95]:
0    8400
1    2461
Name: Pre_existing_conditions, dtype: int64
In [96]:
df.Pre_existing_conditions[no_welfare].value_counts() # 0 for false, 1 for true
Out[96]:
0    75712
1    23949
Name: Pre_existing_conditions, dtype: int64
In [97]:
fig, ax = plt.subplots(2,1)
fig = plt.figure(figsize =(20, 20),tight_layout=False)
ax[0].pie(df.Pre_existing_conditions[with_welfare].value_counts(),
        labels =('No pre-existing conditions','Have Pre-existing conditions'), 
        colors=('darkorange','antiquewhite'),autopct='%1.1f%%')
ax[0].set_title('Pre-existing conditions among welfare recipients %', fontsize=15);

ax[1].pie(df.Pre_existing_conditions[no_welfare].value_counts(),
          labels =('No pre-existing conditions','Have Pre-existing conditions'),
          colors=('darkorange','antiquewhite'),
          autopct='%1.1f%%');
ax[1].set_title('Pre-existing conditions among non-recipients %', fontsize=15);
<matplotlib.figure.Figure at 0x7fabe7399908>
  • Patients enrolled in the welfare programs have less pre-existing conditions(Diabetes,Hipertension, Alcoholism, Handicap) when compared to patients who are not recipients of the program.

Investigate Appointment Attendance Among Welfare Recipients

In [98]:
df.Attendance[with_welfare].value_counts() #attendance of patients receiving welfare
Out[98]:
Attended    8283
Missed      2578
Name: Attendance, dtype: int64

Plot a bar graph to show the distribution

In [99]:
plt.figure(figsize=(25,10))
plt.bar(x=(df.Attendance[with_welfare].value_counts().index),height=df.Attendance[with_welfare].value_counts(), color='teal')
plt.title('Attendance Patients Receiving Welfare', fontsize=20)
plt.xlabel='Gender'
plt.ylabel='Appointments'
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
In [100]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Attendance[with_welfare].value_counts(), labels =('Attended','Missed'), colors=('teal','lightskyblue'),autopct='%1.1f%%');
plt.title('Attendance Patients Receiving Welfare %')
plt.legend()
plt.show()
In [101]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Attendance[no_welfare].value_counts(), labels =('Attended','Missed'), colors=('teal','lightskyblue'),autopct='%1.1f%%');
plt.title('Attendance Patients Not Receiving Welfare %')
plt.legend()
plt.show()
  • Patients not enrolled in welfare attended more appointments than those enrolled in the welfare programs.

Check the differences in attendance along age between men and women enrolled in welfare

In [102]:
by_age_attended_welfare = df.query('(Attendance == "Attended") & (Scholarship == 1)').groupby('Age')['PatientId'].count()
by_age_attended_welfare_fem = df.query('(Attendance == "Attended") & (Scholarship == 1) & (Gender == "F")').groupby('Age')['PatientId'].count()
by_age_attended_welfare_male = df.query('(Attendance == "Attended") & (Scholarship == 1) & (Gender == "M")').groupby('Age')['PatientId'].count()
In [103]:
by_age_attended_welfare.plot(label="Total")
by_age_attended_welfare_fem.plot(label="Female")
by_age_attended_welfare_male.plot(label="Male")
plt.title('Attendance count vs Age (Welfare Recipients)')
plt.legend()
plt.show();

-There are more male patients between 0 and 20, their number declines afterwards with a sharp rise and fall at around 60.

In [104]:
(df.query('(Scholarship == 1) & (Gender == "F")').groupby('PatientId')['AppointmentID'].count())
(df.query('(Scholarship == 1) & (Gender == "M")').groupby('PatientId')['AppointmentID'].count())
Out[104]:
PatientId
6.723485e+08    1
6.892998e+08    1
8.645267e+08    2
2.237714e+09    5
3.539433e+09    1
               ..
9.979459e+14    1
9.993129e+14    1
9.993364e+14    2
9.997482e+14    1
9.999465e+14    1
Name: AppointmentID, Length: 1087, dtype: int64
In [ ]:
 

Question 3: What day of the week has the most missed attendance?

Check the distribution of scheduling of appointments

In [105]:
df['App_dayname'].value_counts() #distribution of appointments by weekday`
Out[105]:
Wednesday    25866
Tuesday      25638
Monday       22714
Friday       19019
Thursday     17246
Saturday        39
Name: App_dayname, dtype: int64
In [106]:
df['App_dayname'].value_counts().plot(kind="barh",
                                      ylabel='Days of week',
                                      xlabel='No. of appointments',
                                      title='Distribution of Appointments',
                                      color='teal',
                                      figsize=(8,6)
                                     )
Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fabe7399390>
  • Saturday has the least number of appointments.
  • There are no appointments on Sundays.
  • The days ranked by total number of appointments in descending order are:
      1. Wednesday
      1. Tuesday
      1. Monday
      1. Friday
      1. Thursday
      1. Saturday
In [107]:
Attended = df.Attendance == 'Attended'
Missed = df.Attendance == 'Missed'
In [108]:
df.App_dayname[Attended].value_counts()
Out[108]:
Wednesday    20774
Tuesday      20488
Monday       18025
Friday       14982
Thursday     13909
Saturday        30
Name: App_dayname, dtype: int64
In [109]:
df.App_dayname[Missed].value_counts()
Out[109]:
Tuesday      5150
Wednesday    5092
Monday       4689
Friday       4037
Thursday     3337
Saturday        9
Name: App_dayname, dtype: int64
In [110]:
df.App_dayname[Attended].value_counts().sort_values()
Out[110]:
Saturday        30
Thursday     13909
Friday       14982
Monday       18025
Tuesday      20488
Wednesday    20774
Name: App_dayname, dtype: int64
In [111]:
plt.rcParams['figure.figsize'] = [25, 15]

        
plt.bar(x=(df.App_dayname[Attended].value_counts().index),
        height=df.App_dayname[Attended].value_counts(),
        label='Attended', color='#24615D')
        
plt.bar(x=(df.App_dayname[Missed].value_counts().index),
        height=df.App_dayname[Missed].value_counts(),
        label='Missed', color='#00FFEF')

df['App_dayname'].value_counts()
plt.title('Appointment Attendance')

plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
  • Patients generally show up for appointments.
  • Excluding Saturday, Thursday has the least number of missed appointments, followed by Friday. This may be due to their close proximity to the weekend.
  • The trend is appointment attendance is the same in comparison to the total number of appointments for that day. This means Appointment Day does not affect attendance.

Check for any relationship between attendance and scheduling

In [112]:
df['Schdl_hour'].value_counts().plot.bar(figsize=(10,8),
                                         color='g',
                                        xlabel='Hours of the Day',
                                        ylabel=('Scheduling of Appointments'),
                                        title=('Scheduling of Appointments vs Time'));

Check Attendance when compared to registration/scheduling of appointments

In [113]:
by_percentange = (df['Part_of_day'][Attended].value_counts() / df['Part_of_day'].value_counts()) * 100 #attendance as a percentage by scheduling time
by_percentange
Out[113]:
Early morning    82.551775
 Morning         78.062472
Afternoon        76.880468
Noon             79.090104
Evening          76.204259
Night            66.666667
Name: Part_of_day, dtype: float64
In [114]:
by_percentange.plot(kind="barh",
                   title="Percentage Attendance");
  • Noon has considerably less scheduling occuring in comparison to the hours, before and after. This is due to Noon being attributed a single, that occurs between(12:00 and 13:00).
In [115]:
df['Part_of_day'][Attended].value_counts()
Out[115]:
Early morning    40419
 Morning         19468
Afternoon        17488
Noon              7145
Evening           3686
Night                2
Name: Part_of_day, dtype: int64
In [116]:
df['Part_of_day'][Attended].value_counts().plot.bar(figsize=(8,6),
                                                    color='g',
                                        xlabel='Part of Day',
                                        ylabel=('Attended Appointments'),
                                        title=('Attended Appointments vs Scheduling:Part of day'));
In [117]:
df['Part_of_day'][Missed].value_counts() #missed appointments categorized by scheduling
Out[117]:
Early morning    8543
 Morning         5471
Afternoon        5259
Noon             1889
Evening          1151
Night               1
Name: Part_of_day, dtype: int64
In [118]:
df['Part_of_day'][Attended].value_counts().plot.bar(figsize=(8,6),
                                                    color='g',
                                        xlabel='Scheduling:Part of Day',
                                        ylabel=('Missed Appointments'),
                                        title=('Missed Appointments vs Scheduling'));
  • Most appointments are registered in the morning.
  • No scheduling occurs between 1 PM and 5 PM.
  • No scheduling occurs past 10 PM.
  • 7 am has the highest amount of scheduling occuring.
  • People are least likely to schedule appointments during the night.
  • There is less activity in terms of scheduling appointments as the day progresses.

Question 4: Does the number of waiting days affect attendance?

In [119]:
plt.figure(figsize=(8,6))
plt.hist((df.waiting_days[Attended]), label='Attended', alpha=0.8)
plt.hist((df.waiting_days[Missed]), label='Missed', alpha=0.7)
plt.grid(True)
plt.box(False)
plt.title('Waiting Days Before Appointment')
plt.xlabel="No. of days"
plt.ylabel="No. of Appointments"

plt.legend()
plt.show()
  • The majority of waiting days are between 0 and 25 for attended appointments.
  • The trend of waiting days is similar in both attended and missed appointments.
In [120]:
df.waiting_days[Attended].unique() #display range of missing days
Out[120]:
array([  0,   2,   1,   3,   4,   9,  23,  11,  10,  18,  17,  14,  28,
        24,  21,  15,  43,  30,  29,  22,  42,  32,  31,  56,  45,  46,
        39,  37,  38,  44,  52,  65,  67,  91,  66,  84,  78, 115, 109,
        70,  57,  16,  58,  63,  50,  51,  41,  73,  59,  49,  20,  34,
         6,  33,  35,  36,  12,  40,   8,   5,  25,   7,  48,  27,  47,
        53,  13,  62,  55,  19, 176,  54,  77,  83,  76,  89, 103,  81,
        26,  72,  60,  79,  68,  61,  85,  64, 112,  86,  98,  94, 142,
       162, 169, 104, 133, 125, 155,  96,  69,  90, 127, 119,  74,  71,
        88,  82, 108, 110, 102, 111, 122, 101, 105,  92,  75,  87,  80,
        97,  93, 107,  95, 179, 117, 123])

Check mean for waiting days categorized by attendance

In [121]:
np.mean(df.waiting_days),np.mean(df.waiting_days[Attended]),np.mean(df.waiting_days[Missed])
Out[121]:
(10.18425290892311, 8.754659441320515, 15.835484449224701)
In [122]:
fig = plt.figure(figsize =(10, 8),tight_layout=False)
y=np.array([9,16])
category = ('Attended','Missed')
explosion = [0.2, 0]
col = ('mediumspringgreen','plum')
plt.title('Mean of Waiting Days Before Appointment')
plt.pie(y, labels = category,explode=explosion, colors=col,shadow = True, startangle = 90)
plt.show()
In [123]:
print("The mean for all waiting days is {},\nthe mean for waiting days for attended appointments is {}\nand the mean for waiting days of missed appointments is {}"
      .format((np.mean(df.waiting_days).round(0)),
             np.mean(df.waiting_days[Attended]).round(0),
             np.mean(df.waiting_days[Missed]).round(0)))
The mean for all waiting days is 10.0,
the mean for waiting days for attended appointments is 9.0
and the mean for waiting days of missed appointments is 16.0
  • The mean number of waiting days is 10 for all appointments.
  • Attended appointments have a smaller mean of about 9 days
  • Missed appointments have a mean number of about 16 days.
  • Appointments with shorter number of waiting days have a higher attendance.

Conclusions

Female patients missed more appointments.

  • The number of missed appointments by female patients: 14591
  • The number of missed appointments by male patients: 7723
  • Female patients comprise of 65.4% of missed appointments.
  • Males comprise of 34.6% of missed appointments.
  • Majority of appointments are attended, with only 20.2% of appointments being missed.
  • There are 71837 appointments attributed to female patients and 38685 appointments attributed to male patients
  • The number of female patients is significantly higher at 65% while the percentage of men is 35%.The ratio of attendance between female (64.9%) and male(35.1%) is similar to the overall gender distribution This lessens the impact of gender on attendance.
  • There are 62299 unique patient id's that are linked to 110522 appointments.
  • This means their exists repeat patients who had more than one appointment.
  • There are 54154 uniue patient id's for attended appointments,and 17661 unique patient id's for missed appointments. This means that an overlap occurs where the same patients appear in both categories. It can be inferred that some of the patients who had more than one appointment,attended some appointments and missed others.
  • There are 26410 patients with one or more pre-existing conditions(diseases) while 84112 patients lack any pre-existing conditions.
  • Patients under 20 generally have only have one pre-existing condition.
  • There is an increase in the number of pre-existing conditions as ge progresses.
  • Patients who posses 3 pre-existing conditions, are found between the ages of 40 and 80.
  • Female patients have a mean age of 38.89
  • Male patients have a mean age of 33.74
  • There a more male than female children under the age of 10. This is the only age bracket where the number of males is higher than females
  • Male age distribution is more skewed to the right (skewness:0.25) than female age distribution(skewness: 0.09). This signifies that the difference between the mean and median is greater in males than in females.

Patients not enrolled in welfare attended more appointments than those enrolled in the welfare programs. Welfare does not have an impact on attendance.

  • There are 10861 appointments of patients enrolled in the welfare program while the rest 99661 appointments are of patients not enrolled. Most appointments are of paients not enrolled in welfare.
  • There are 5788 patients enrolled in the welfare.
  • There are 56511 patients not enrolled in the welfare program.
  • Patients enrolled in the welfare programs have less pre-existing conditions(Diabetes,Hipertension, Alcoholism, Handicap) when compared to patients who are not recipients of the program.

Saturday has the least number of appointments.

  • There are no appointments on Sundays.
  • The days ranked by total number of appointments in descending order are:
          --Wednesday
         -- Tuesday
         -- Monday
         -- Friday
         -- Thursday
          --Saturday
  • Excluding Saturday, Thursday has the least number of missed appointments, followed by Friday. This may be due to their close proximity to the weekend.
  • 7 am has the highest amount of scheduling occuring.
  • People are least likely to schedule appointments during the night.
  • There is less activity in terms of scheduling appointments as the day progresses.
  • The trend in appointment attendance is the same in comparison to the total number of appointments for that day. This means Appointment Day does not affect attendance.
  • Patients generally show up for appointments. This data requires specific time information for appointment days, to get a more accurate representation of attendance. Another question to be asked is, why scheduling of appointments decreases as the day progresses? The reason why Saturday has the least number of appointments regardless of being a weekend, requires further analysis.

Appointments with shorter number of waiting days have a higher attendance.

  • The majority of waiting days are between 0 and 25 for attended appointments.
  • The trend in distribution of waiting days is similar in both attended and missed appointments.
  • The mean number of waiting days is 10 for all appointments.
  • Attended appointments have a smaller mean of about 9 days.
  • Missed appointments have a mean number of about 16 days.

LIMITATIONS

  1. Ilha de Trinidade is an island in the Pacific Ocean, and only 2 appoitments from this area are recorded in the dataset, and both show a missed attendance.The analysis cannot give an accurate representation of the patients in this neighbourhood.
  2. The appointment dates lack the time they were attended, this would assist in analysing attendance and determining which hours of the day or night are busiest in the hospital.

Resources

  1. Wikipedia : https://en.wikipedia.org/wiki/Trindade_and_Martim_Vaz Retrieved on 23-05-2022.
  2. Facts About Bolsa Família : https://thebrazilbusiness.com/article/facts-about-bolsa-familia Retrieved on 23-05-2022.

Submitting your Project

Tip: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

Tip: Alternatively, you can download this report as .html via the File > Download as submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

Tip: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [124]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])